The dataset in this report contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
There are 113937 rows in this dataset while the number of the unique value of ListingKey is only 113066 (less than the number of the rows), which should be checked before further analysis.
So I checked the count of each ListingKey and the top 5 ListingKeys which appear more than once in the dataset are:
## # A tibble: 5 x 2
## ListingKey n
## <chr> <int>
## 1 17A93590655669644DB4C06 6
## 2 349D3587495831350F0F648 4
## 3 47C1359638497431975670B 4
## 4 8474358854651984137201C 4
## 5 DE8535960513435199406CE 4
For the ListingKey (“17A93590655669644DB4C06”) which appears six times in the dataset, I checked all the six records and found that the ProsperScore column is different among these records. Indeed, this is the only difference exists in the data and there is not a column which indicates when the ProsperScore was changed. So I prefer to treat these records as problem records and decide to remove them from the dataset.
rm_keys <- count(dat, ListingKey, sort = T) %>% filter(n > 1) %$% ListingKey
dat %<>% filter(!(ListingKey %in% rm_keys))
rm(rm_keys)
It’s an important step to check the missing value in the dataset and decide whether the missing data will affect the analysis significantly before performing any analysis. So I calculated the ratio of missing value for each column of this dataset and printed the columns whose missing data ratio is more than 30% in the below table:
## # A tibble: 12 x 2
## variable na_ratio
## <chr> <dbl>
## 1 GroupKey 0.8811732
## 2 LoanFirstDefaultedCycleNumber 0.8489652
## 3 ScorexChangeAtTimeOfListing 0.8316450
## 4 TotalProsperLoans 0.8060656
## 5 TotalProsperPaymentsBilled 0.8060656
## 6 OnTimeProsperPayments 0.8060656
## 7 ProsperPaymentsLessThanOneMonthLate 0.8060656
## 8 ProsperPaymentsOneMonthPlusLate 0.8060656
## 9 ProsperPrincipalBorrowed 0.8060656
## 10 ProsperPrincipalOutstanding 0.8060656
## 11 CreditGrade 0.7420415
## 12 ClosedDate 0.5094129
At first it looks like that there are 12 variables whose missing data ratio is more than 30%, however, after reading the data variables definitions, it turns out NA value of almost all these variables has the specific meaning, which means they are not really the “missing value”. Therefore, I believe that this is probably a pretty clean dataset and it’s time to play with the data!
For the Univariate Plots Section, I choose the following questions for this project:
At the first of all, it’s useful to check the time period of the dataset:
## [1] "2005-11-09 20:44:28 UTC" "2014-03-10 12:20:53 UTC"
It turns out the time period of this dataset is from 2005-11-09 to 2014-03-10.
Now let’s see how many listings were created in each day:
It’s clearly there is a big gap in this dataset, so we should consider that whether we should ignore the older part of the data or not. For this dataset, I decided to ignore the older data, and the reasons are as belows:
the three credit variables ProsperRating (numeric), ProsperRating (Alpha) and ProsperScore are only available after 2009-07-13;
the recent data will be more valuable than the older data to make prediction in the future;
the time period of the recent data is more than four years so it’s probably sufficent to build the model.
So I removed the data before 2009-07-13 from this dataset:
It’s interesting to see that the most popular ListingCategory is “Debt Consolidation”, which seems more than the sum of all the other types. The second ListingCategory is “Other”, which provides little effective information about the usage of the debt.
There is no surprise that the majority of the borrowers have a job, because people with no job have difficulties to obtian the loans.
Summary information of EmploymentStatusDuration is:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 30.0 74.0 103.2 148.0 755.0 19
The figure above shows that the distribution of EmploymentStatusDuration is positive skewed, and the summary of EmploymentStatusDuration shows that about one quarter have more than 12 years of employment, and about one half have more than 6 years of employment.
It shows that the homeowner is slightly more than these who don’t have a home.
We can see that the majority have verified income, and most two popular income ranges are “$25,000-49,999” and “$50,000-74,999”.
Wow, the figure above looks bad, most part of it is just blank. However, we can found an important clue: “some people have a monthly income of more than $1500000”.
Because this is a really big number, so the first thought came up to my mind is: “Is this income number verified?”
In order to check this, let’s filter the records to find who have more than $1500000 monthly income:
## # A tibble: 1 x 3
## IncomeVerifiable StatedMonthlyIncome LoanOriginalAmount
## <fctr> <dbl> <int>
## 1 False 1750003 4000
Aha, there is only one person has such a high income but it’s not verified. So, is this a false information or the person is a really wealthy but unwilling to share his or her information?
After checked every field of this person, I found this person has already finished 33 trades and has no record of delinquencies. Also, his credit score looks good.
However, I noticed that the LoanOriginalAmount is only “4000”, so I am wondering why a person who can earn more than $1500000 need to borrow “$4000”?
This is unreasonable.
So, for now, I would say that this StatedMonthlyIncome number is probably not a true number, even though the person has a good credit history.
Now, let’s check the quantile percentage of StatedMonthlyIncome:
## 1% 25% 50% 75% 99%
## 167.2517 3416.6667 5000.0000 7083.3333 20833.3333
## 100%
## 1750002.9167
And plot the distribution of StatedMonthlyIncome without the lowest 1% and the highest 1%:
Now the visualisation looks more reasonable, which is also a positive skewed distribution.
The distribution of ProsperScore looks like a normal distribution while the top 3 most frequently ProsperScores are 4, 6, and 8, and the top 2 least frequently ProsperScores are 1 and 11. However, the data varibale definitions show that the ProsperScore ranges from 1-10, with 10 being the best, or lowest risk score. So I decide to double check the ProsperScore field:
## # A tibble: 1 x 2
## ProsperScore n
## <dbl> <int>
## 1 11 1408
It seems like that there are many rows with the ProsperScore equals 11, but 11 is not a valid score based on the data varibale definitions. So I suspect that this maybe a data error or something else, which need to be investigated in the further analysis.
From the Data variable definitions we know that ProsperRating (numeric) and ProsperRating (Alpha) should be consistent with each other, so let’s check it:
## # A tibble: 7 x 3
## `ProsperRating (numeric)` `ProsperRating (Alpha)` n
## <fctr> <fctr> <int>
## 1 1 HR 6901
## 2 2 E 9639
## 3 3 D 14070
## 4 4 C 17862
## 5 5 B 15166
## 6 6 A 14237
## 7 7 AA 5280
The result shows they are consistent with each other, so we just need choose one variable from them, here I choose the ProsperRating (Alpha) and rename it to ProsperRating.
ProsperRating should have some strong relationship with ProsperScore, but first let’s check the ProsperScore when ProsperRating is NA:
## # A tibble: 1 x 3
## `is.na(ProsperScore)` `is.na(ProsperRating)` n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 83155
The result shows that when ProsperRating is NA, the ProsperScore is also NA, and vice versa. Pretty good!
Now let’s check the relationship between ProsperRating and ProsperScore:
The above heatmap shows that there is a positive relationship between ProsperScore and ProsperRating, however, in each category of the ProsperRating, there is obvious variation in ProsperScore, which means ProsperScore can provide addtional information along with ProsperRating, so both variables will be kept for future use.
The above histgoram plot shows that the distribution of BorrowerRate roughly follows a normal distribution, however there is a pinnacle at 0.32, which is a rather high BorrowerRate.
The above plot shows that the number of prosper loans in CA is far more than any other states, followed by NY, TX, and FL, however, some states such as WY, AK, or VT have only a few loans.
The above plot clearly shows that most borrowers were either unwilling to provide their occupation information or have unusual occupation, because there are a lot of “Other” value in the Occupation field.
The distribution of CreditScoreRangeLower and CreditScoreRangeUpper are quite similar and both of them follow a slightly positive skewed distribution.
It’s interesting to see that there are several peaks in the histogram: 4000, 10000, 15000, 20000, 25000, which is reasonable in the real world.
We can see that only a little fraction of the loans are in the “Past Due” status, and generally the longer the past due days, the less they are.
However, the number of the “Chargedoff” loans is 5336 and the number of the “Defaulted” loans is 1005, which means the default risk is high and should be addressed by the investors (Loans with a “Chargedoff” or “Defaulted” status are about 7.63% of the total loans).
When I read the data variable definitions, I noticed that there is a column called LoanFirstDefaultedCycleNumber which indicates the cycle when the loan was charged off, and if the value of this column is NA, it means that the loan hasn’t been charged off. So I want to check whether this column is consistent with the LoanStatus column:
## # A tibble: 7 x 2
## LoanStatus LoanFirstDefaultedCycleNumber
## <fctr> <int>
## 1 Chargedoff NA
## 2 Chargedoff NA
## 3 Chargedoff NA
## 4 Chargedoff NA
## 5 Chargedoff NA
## 6 Chargedoff NA
## 7 Chargedoff NA
There are seven rows which have a “Chargedoff” LoanStatus while have a NA value in LoanFirstDefaultedCycleNumber column, so these records are not consistent.
The LoanStatus variable also has a possible value “Defaulted”, and records with a “Defaulted” LoanStatus also should not have a NA value in LoanFirstDefaultedCycleNumber column, let’s check this:
## # A tibble: 100 x 2
## LoanStatus LoanFirstDefaultedCycleNumber
## <fctr> <int>
## 1 Defaulted NA
## 2 Defaulted NA
## 3 Defaulted NA
## 4 Defaulted NA
## 5 Defaulted NA
## 6 Defaulted NA
## 7 Defaulted NA
## 8 Defaulted NA
## 9 Defaulted NA
## 10 Defaulted NA
## # ... with 90 more rows
There are 100 inconsistent records, and I suppose the reason that LoanFirstDefaultedCycleNumber has a NA value for Chargedoff and Defaulted loans is just because the defaulted cycle number haven’t been recorded correctly, so even the LoanFirstDefaultedCycleNumber is NA these loans are still defaulted loans, which is similar for the Chargedoff loans.
So, for this project, I decided to define the defaulted loans as those meet the below criterias:
LoanStatus is either Chargedoff or Defaulted; or,LoanFirstDefaultedCycleNumber is not NA and LoanStatus is Completed.Also, I will define the non-defaulted loans whose LoanStatus are Completed and LoanFirstDefaultedCycleNumber is NA, which means I only use the records whose LoanStatus is Chargedoff or Defaulted or Completed.
Now, there are 83155 records left in the dataset.
First, let’s create a new variable to indicate whether a loan is defaulted or not:
dat %<>%
mutate(is_defaulted = as.factor(if_else(
(LoanStatus %in% c("Chargedoff", "Defaulted")) |
(LoanStatus == "Completed" & !is.na(LoanFirstDefaultedCycleNumber)),
"Defaulted", "Not Defaulted"
)))
Then, we can create a bar plot to show the percentage of the defaulted loans:
Furthermore, the distribution of the cycle of these defaulted loans may reveal more information:
It’s interesting to find that if default didn’t occur at the beginning cycle, then it’s unlikely to occur in the following three cycles. Interesting.
There are 113,937 loans with 81 variables including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information in this dataset.
There are so many variables in this dataset so it is difficult to explore each variable one by one. Instead, it woule be more appropriate to adopt a problem-oriented approach to explore this dataset. So I choose this problem-oriented approach to finish this project.
The main features of interest are LoanFirstDefaultedCycleNumber and LoanStatus, because for the loan business, the most important concern is whether and when the loan will default. (i.e. credit/default risk)
For the creditor/lender, what they really care about is whether the borrower will default on a specific loan and when the borrower will most likely default.
For this project, I will only focus on whether the borrower will default. So I created a new feature with the name “is_defaulted” and choose it as the main feature of interest.
Of all the features in this dateset, I will use ProsperScore, ProsperRating, CreditScoreRangeLower, CreditScoreRangeUpper along with some other features (more details in “Bivariate Plots Section”“) to predict whether the a specific loan will default or not.
Yes, I create is_defaulted from LoanFirstDefaultedCycleNumber and LoanStatus, and I modified some variables (changing the factor levels or reorder the factor or rename the variable) in order to make better visualisations.
In addition, in the Bivariate Plots (Analysis) Section and Multivariate Plots (Analysis) Section, I will create more new variables to get better understanding of the dataset and build better models.
Yes, such as EmploymentStatusDuration and LoanFirstDefaultedCycleNumber, they don’t follow normal distribution and both of them are positive skewed.
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Yes, as I said before, I changed the form of some variables (changing the factor levels or reorder the factor) in order to make better visualisations and I checked the time period of this dataset to choose a subset of the data. Also, I checked the consistency for some related features such as LoanStatus and LoanFirstDefaultedCycleNumber.
After checked the dataset, I decided to use only the records whose LoanStatus is Chargedoff or Defaulted or Completed. As a result, only about 73% of the original dataset were chosen to perform further analysis and build the model.
In this section, I will mainly explore the relationships between the target feature and the support features, inculding the relationship between the target feature and a single support feature and the relationship between the target feature and multiple support features. In addition, the relationships between support features will be explored as well to obtain more insights and address the multicollinearity problems in the model stage.
The first and the most important step is defining the target feature properly, which was already done in the pervious section.
Generally, whether a loan will default has strong relationship with the credit status of the borrower. In this dataset, there are two main variables which indicate the credit of the borrower: ProsperRating and ProsperScore.
Now, it’s time to visualise the relationship between default status with ProsperScore and ProsperRating, first let’s check the ProsperRating:
It’s no surprise that the percent of the defaulted loan has an inverse realitionship with ProsperRating, that is, better ProsperRating means lower default rate.
Second, let’s check the ProperScore:
Form this figure we can notice that the percent of the defaulted loan doesn’t have an exact inverse relationship with ProsperScore, there are an exception when ProsperScore arounds 4.
Besides the ProsperScore and ProsperRating, there are other features may have relationships with the defaulted rate, a rigorous way to identify all the potential useful features is digging into each feature one by one, or divided all the feature into different groups based on their type and use some screening skills to pick out the most useful features. However, I won’t do this here in order to avoid making this report too long to read. Instead, here I will quickly create a correlation matrix plot for the numeric variables to see whether there are highly correlated features in the dataset.
From the above plot we see that some features have a very high correlation between each other, and features with high correlation probably have a lot of inforamtion in common and may give rise to multicollinearity problems in the model stage.
In this project, I used the correlation matrix plot as a reference and then choose the features (only a few features here, not the whole set for real life project) which may have relationships with defaulted rate base on the domain knowledge:
BorrowerRate: The higher the BorrowerRate, maybe the higher the defaulted probability;BorrowerState: The defaulted rate may be different in each State due to their different economic condition;EmploymentStatus and EmploymentStatusDuration: The borrower without a job may be more likely to default;Occupation: Different occupation may have different defaulted probability;IsBorrowerHomeowner: The defaulted probability may be different between those who own a home and those who don’t;CreditScoreRangeLower and CreditScoreRangeUpper: Provide similar information with ProsperScore and ProsperRating;IncomeRange: Borrowers with different IncomeRange may have different default probability;IncomeVerifiable: Borrowers with verified income may have lower defaulted probability;LoanOriginalAmount: Loan with different Original amount may have different default probability;ListingCategory: Loan of different ListingCategory may have different default probability.There are two variable types here: categorical variables and numeric variables.
For categorical variables, I will create a bar plot to show the default rate for each level (use a custom defined function named bar_plot). And for the numeric variables I will create a boxplot along with the violin plot to show the distribution of the variable separately for default loans and non-default loans (use a custom defined function named violin_boxplot).
It seems like there is obvious difference among different states. Also, let’s create a map to see whether there is some spatial pattern for default rate:
We can see that the default ratio of the borrowers whose EmploymentStatus is either “Not employed” or “Other” is higher than those who are employed. Also, we should noticed that the number of loans has a high variance among different EmploymentStatus groups.
We can see that different occupation has different default ratio, and the default ratio of the sophomore college student is as high as more than 80%.
We can see that the defualt ratio of the borrowers who have a home is slightly lower than those who don’t have a home.
We can clearly see that higher income means lower default ratio.
We can see that the defualt ratio of the borrowers whose income are verified is slightly lower than those whose income are not verified.
We can see that different ListingCategory have quite different default ratios.
We can see that the higher the BorrowerRate, the higher the default ratio.
We can see that EmploymentStatusDuration has no clear relatioship with default ratio.
We can see that the lower the CreditScoreRangeLower and CreditScoreRangeUpper, the higher the default ratio.
We can see that loans with smaller LoanOriginalAmount tend to be more likely to default. Interesting.
I explored the relationships between the target variable is_defaulted and 14 predictors, including 8 categotical variables and 6 numeric variables.
How did the feature(s) of interest vary with other features in the dataset?
The feature of interest vary with a lot of other features in this dataset. For the 14 variables that I chose in this project, almost all of them have some relationship with the target variable except for EmploymentStatusDuration based on the bar plots and violon-boxplot. For example, the plots above show that the default rate has a clear relatioship with ProsperScore, ProsperRating, EmploymentStatus, Occupation, IncomeRange, and BorrowerRate.
Yes, there is an interesting relatioship between IncomeRange and IncomeVerifiable:
It shows clearly that most of the borrowers who are either not employed or have zero income (maybe just unwilling to provide the income status) have unverified income, also according to the relationship between is_defaulted and IncomeVerifiable we see that borrowers with unverified income have a higher default rate. So we can use these variables to predict the default rate for new loans.
The strongest relationship I found is between the is_defaulted and ProsperRating, better ProsperRating means lower default rate without exception, which means the ProsperRating is a very important factor to be considered when investing in a prosper loan.
Also, I found an interesting pattern for the college student group:
The above figure shows that for the College Student: higher grade means lower default rate, with exception for Sophomore Student. The labels in the figure stand for the sample size for each group, which means about 10 out of the 12 Sophomore Students were defaulted. This is a really high ratio.
From previous sections we found that default rate has a strong relationship with ProsperRating and ProsperScore, also the default rate should change over time, so I decided to create a time series plot for both default rate and ProsperScore:
It clearly shows that as the time goes by the ProsperScore became lower and lower but the DefaultRatio became higher and higher.
The model part has three steps:
The target variable is is_defaulted and I choose 13 features as the input variables of the model.
After the data preprocessing of all the previous sections, now there are 25966 rows left in the dataset to build the model.
In order to evaluate the performance of the model, we need to split the dataset into two parts: train dataset and test dataset. Here I used 80% of the data to train the model, and used the left 20% of the data to evaluate the model.
In this project I tried Logistic Regression and Support Vector Machine to build the model:
Logistic Regression Model
Notice: Change the
evaloption toTRUEif the model need to be updated.
model_glm <- glm(formula = is_defaulted ~ ., data = train_set, family = binomial)
save(model_glm, file = "data/model_glm.rda")
Support Vector Machine
Notice: Change the
evaloption toTRUEif the model need to be updated.
model_svm <- svm(is_defaulted ~ ., data = train_set, probability = TRUE)
save(model_svm, file = "data/model_svm.rda")
First, let’s visualise the performance of the Logistic Regression model:
Then, let’s visualise the performance of the Support Vector Machine model:
We can see both the logistic regression model and support vector machine model are not good, no matter which threshold we choose to decide whether a loan will default or not, the result is not satisfied.
For the classification task, the most relevant performance metrics are Precision, Recall, and F1 Score:
One thing that we should keep in mind is that generally we need to make tradeoff between Precision and Recall, so whether we should choose the model with high precision or high recall depends on the real problem and our target.
For example, if we want to make the recall ratio higher, we can choose a lower threshold for the model, in fact, if we simply predict “Positive” for each sample, the recall ratio will equal 1. And if we want to make the precision ratio higher, we can choose a higher threshold for the model. In order to use a single ratio to compare different ratio, we can use the F1 Score:
\[F1\;Score = \frac{2\cdot Precision\cdot Recall}{Precision + Recall}\]
First, let’s plot the F1 Score vs. Thresholds for each model we built:
Then, let’s plot the Precison-Recall Curve for each model we built:
Finally, let plot the ROC curve and calculate the AUC ratio:
ROC curve of Logistic Regression Model is:
and the AUC ratio for Logistic Regression Model is 0.7210596.
ROC curve for Support Vector Machine is:
and the AUC ratio for Support Vector Machine is 0.6817933.
From the plot in the Multivariate Plots Section we noticed that as the time goes by, the ProsperScore became lower and lower and the DefaultRatio became higher and higher, which means the defaulted risk of the Prosper Loan Business should be addressed by the investors and they should be very careful when make investments on Prosper Loan.
The most interesting pattern I found is the default rate within the student group. Generally, higher grade means lower default rate, however, this is not true for the Sophomore Student: there were about 10 out of 12 Sophomore Students who were defaulted, which was a really high ratio.
Discuss the strengths and limitations of your model.
Yes, I created two models with this dataset: Logistic Regression and Support Vector Machine.
From the model evaluation part we can see that both of the models have some predictive power for default rate, however, this predictive power is weak. The reason may be that I haven’t performed the parameter tuning for these model, also I think that feature engineering is also very important if we want to build a better model.
The plot one indicates a strong relationship between IncomeRange and the default ratio. For the borrwoers who have no income or are not employed, the default ratio is very high. Also, there is an exact inverse relationship between income and the default ratio: the higher the income, the lower the default ratio.
The plot two reveals a very interesting pattern of the default rate among different college student groups: the sophomore college students have the highest default rate than any other college student grouop, followed by the freshman, junior, senior and Graduate.
Additionally, the number of loans has an inverse trend.
We can also notice that students from community college have higher default ratio than those from technical school, however, the sample number of the students from technical school is just one, which means it has no statistically meaning.
The plot three shows that there is a inverse relationship between ProsperScore and Default Rate, which is no surprise. Also, it shows that how these two indicators change over time.
An important finding is that as time goes by:
The conclusion is that the risk of the Prosper Loan Business had been higher than the days when it started, so investors should be prudent and thoughtful when they make new investments on this platform.
After performed the exploration on this dataset, I found a lot of insights about the loan business. Most of the insights are just the confirmation of the knowledge we already have: such as better credit score and higher income lead to lower default ratio, and different states or different occupations have different default ratio, and so on.
Besides, I have found two interesting insights I haven’t known before:
For college students, after they entered into the college, more and more of them may choose to borrow money as time went by, and the higher the grade, the lower the default ratio, with exception for the sophomore students. So why the sophomore students have such a high default ratio? This is an interesting question and I would like to do some research about it in the future.
The risk of the Prosper Loan Business has become higher than the days when it started, which is evidenced by the higher default rate and lower ProsperScore than before.
Besides the exploration analysis and visualisation on this dataset, I also built two models to try to predict the default possibility of a certain loan based on the features of the borrower as well as the features of the loan itself.
However, the performance of the models were not good and the reasons are:
Feature selection and feature engineering were not performed yet, which play an important role for machine learning models. In fact, the selected features may have some collinearity, because we got the message prediction from a rank-deficient fit may be misleading from predict.lm(object, newdata, se.fit, scale = 1, ... (perhaps we can also try deep learning to make the feature engineering process automatically);
Model parameters were not tuning to achieve the potential highest performance of the model, and other models such as GBRT and Random Forest should be tested as well.
Also, there are sereval issues that we should addressed in the future:
In the model training part, I randomly chose 80% of the data as train data and use the left 20% of the data as test data, which seems reasonably at first. However, if we think about our target problem carefully, we may notice that once we have a model we need use it to predict the future outcome of the currently unfinished loans. So, we are forecasting the future. And for this kind of the problem, the test daatset should be the recent part and the train dataset should be the older part of the whole dataset. We can also use the moving time window technique to generate the train and test dataset;
Predict whether a loan will default or not is really a difficult task because this is predicting the future, and the future is definitely full of the uncertainty. For example, even a borrower with low income may repay his(her) debt on time if he has be abitily to clear the debt when it matures, conversely, even a borrower with high income may defualt on his(her) loan if he(she) has some financial difficulties when the debt matures. In addition, even for the same borrower, the default possibility may be different for different loans just because the consequence of default are different for different loans (ie. mortgage loan vs. credit card).
From final plot 3 we see that the default ratio for all the loans changes over time, which means besides the features of the borrower and the features of the loan itself there are some other time-related features have impact on the defualt possibility: the Macroeconomic Condition and Currency Policy of differnent time periods. Therefore, if we really want to build a model which can be used in the real world, we should take the Macroeconomic Condition and Currency Policy into consideration as well as the features about the borrowers and the loan.